About Seurat and SeuratData

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate types of single-cell data. After this short introduction workshop you can read Seurat offical website to dive a bit deeper.

SeuratData is a mechanism for distributing datasets in the form of Seurat objects using R’s internal package and data management systems. It represents an easy way for users to get access to datasets that are used in the Seurat vignettes.

Install Seurat and SeuratData

(If you have installed them before workshop, you do not need to run this block of code.)

install.packages('Seurat')

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

remotes::install_github("satijalab/seurat-data", quiet = TRUE)

1. Load Seurat and SeuratData

We will use Seurat V5, which was published last year. Seurat V5 has gradually gained popularity due to its faster running speed. However, Seurat V5 has some data structure changes compared with older versions (V3 & V4), which may cause some old codes to fail to run. More details can be found on this website.

library(Seurat)
options(Seurat.object.assay.version = "v5")
library(SeuratData)

Load the data

We will use the pbmcsca dataset. This public dataset includes single-cell RNA-seq data of human peripheral blood mononuclear cells (PBMCs) using multiple sequencing platforms. Only data that passed quality control are included in the pbmcsca dataset.

data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)
## Warning: Assay RNA changing from Assay to Assay
table(pbmcsca$Method)
## 
##   10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B   10x Chromium (v3) 
##                3362                3222                3222                3222 
##            CEL-Seq2            Drop-seq             inDrops            Seq-Well 
##                 526                6584                6584                3773 
##          Smart-seq2 
##                 526

The table indicates the number of cells sequenced using different platforms.

Today we will consider two scRNA-seq sequencing results (10x Chromium (v2) & 10x Chromium (v3)). PBMCs were sequnced from two patients using different versions of the sequencing platform.

Extract raw counts

The raw count matrix and the information of each gene and each cell are saved in a Seurat object pbmc_10x_v2 and pbmc_10x_v3 independently. In addition, we combine the two sequencing results without any processing and store them in the Seurat object pbmc_combo:

pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]

Data Structure of a Seurat object

The Seurat object is a representation of single-cell expression data for R; each Seurat object revolves around a set of cells and consists of one or more Assay objects, or individual representations of expression data (eg. RNA-seq, ATAC-seq, etc). These assays can be reduced from their high-dimensional state to a lower-dimension state and stored as DimReduc objects. Seurat objects also store additional metadata, both at the cell and feature level (contained within individual assays). The object was designed to be as self-contained as possible, and easily extendable to new methods.

We use Seurat object pbmc_10x_v3 as an example.

The raw count matrix of scRNA-seq experiment is here:

pbmc_10x_v3@assays$RNA

**ⓘ Count matrix in Seurat*
A count matrix from a Seurat object displays the genes in rows and the cells in columns.

The information and labels of each cell is here:

pbmc_10x_v3@meta.data

And the dimensional reduction information will be saved here:

pbmc_10x_v3@reductions

2. Analysis of single-cell RNA-seq data from a single experiment

Let’s start with a simple case: the data generated using the the 10x Chromium (v3) platform (i.e the Seurat object pbmc_10x_v3.

Let’s first take a look at how many cells and genes passed Quality Control (QC).

**ⓘ Count matrix in Seurat*
A count matrix from a Seurat object displays the genes in rows and the cells in columns.

dim(pbmc_10x_v3)
## [1] 33694  3222

Here we have 3222 cells with 33694 genes that passed QC.

Normalization

We use the Seurat function NormalizeData() to normalize raw counts. By default, Seurat implements a global-scaling normalization method called LogNormalize that normalizes the gene expression measurements for each cell by the total number of counts accross all genes, and multiplies this by a scaling factor (10,000 by default), then log transforms the result:

pbmc_10x_v3 <- NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize", 
    scale.factor = 10000)

Feature Selection

We use the Seurat function FindVariableFeatures() to select highly variable genes which have most of useful information for downstream analysis.

Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:

pbmc_10x_v3 <- FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)

Scaling

The single cell dataset likely may contain ‘uninteresting’ sources of variation, for example technical noise, batch effects, or even biological sources of variation (cell cycle stage). As suggested by Buettner et al, 2015, regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering.

Seurat constructs linear models to predict gene expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for dimensionality reduction and clustering.

We use the Seurat function ScaleData() to obtain the scaled matrix:

pbmc_10x_v3.all.genes <- rownames(pbmc_10x_v3)
pbmc_10x_v3 <- ScaleData(pbmc_10x_v3, features = pbmc_10x_v3.all.genes)

Principal component analysis (PCA)

We perform PCA on the scaled data. By default, the genes in pbmc_10x_v3@var.genes are used as input, but they can be defined by specifying the argument pc.genes.

We have found that running dimensionality reduction on highly variable genes can improve performance. However, with UMI data – particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.

We run PCA on top 3,000 most variable genes:

(The PCA result will be stored in the pbmc_10x_v3@reduction. The output information will tell you which genes have high correlation (also positive or negative) with top 5 principal components)

pbmc_10x_v3 <- RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1 
## Positive:  LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA 
##     CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1 
##     CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130 
## Negative:  IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A 
##     TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6 
##     CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2 
## PC_ 2 
## Positive:  NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC 
##     CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4 
##     PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X 
## Negative:  EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0 
##     PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1 
##     CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3 
## PC_ 3 
## Positive:  CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH 
##     FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247 
##     MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A 
## Negative:  CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74 
##     IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B 
##     P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19 
## PC_ 4 
## Positive:  IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93 
##     IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E 
##     SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2 
## Negative:  FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10 
##     MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3 
##     GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC 
## PC_ 5 
## Positive:  GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW 
##     CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM 
##     HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1 
## Negative:  IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP 
##     CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2 
##     NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3

**ⓘ Choosing parameters in PCA (genes and pcs)*
How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of today’s workshop.

But we highly recommend you to read this document before analyzing your own scRNA-seq dat, where the authors show how to use some visualization methods to guide your choice.

2D Visualization

Using t-distributed stochastic neighbour embedding (t-SNE)

We run the t-SNE algorithm first. It is calculated based on PCs, we use top 30 PCs in this example:

pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:30)

We can then draw a t-SNE plot by using the Dimplot() function by specifying the argument reduction = 'tsne':

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE)

We can colour points (cells) by using other information by specyfing the argument group.by. For example, to display the sequencing platform:

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Method')

You can see all cells in pbmc_10x_v3 data are sequenced by using 10x Chromium(v3) platform.

You can try other numbers of PCs and see what changes?

pbmc_10x_v3 <- RunTSNE(pbmc_10x_v3, dims = 1:50)
DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'Experiment')

ⓘ Interpreting t-SNE results As presented earlier, t-SNE interpretation must be taken with caution. Nowadays using UMAP visulization is more appropriate.

Using Uniform manifold approximation and project (UMAP)

UMAP is shown based on the PCs. Here we use the top 30 PCs:

pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims=1:30)

We draw the UMAP plot with the Dimplot() function by specifying the argument reduction = 'umap':

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Experiment')

Clustering

By default, Seurat uses the Louvain algorithm.

The Louvain algorithm requires a neighbor graph as input. Therefore, we first run the FindNeighbors() function first. Note that FindNeighbors() is also based on PCs, here we use all 30 top PCs:

pbmc_10x_v3 <- FindNeighbors(pbmc_10x_v3, dims = 1:30)

We then run the FindClusters() function for clustering.

Argument Algorithm=1 means we are using the Louvain algorithm for clustering.

Other options are available such as Algorithm=4 for the Leiden algorithm, but you have to install Python and some Python packages first. You can also try different resolution for more or less clusters. See ??FindClusters for more details.

pbmc_10x_v3 <- FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3222
## Number of edges: 134293
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds

We can use t-SNE and UMAP to visualize the clustering results. The argument group.by='seurat_clusters' means that we want to color the cells according to the clustering results:

DimPlot(pbmc_10x_v3, reduction = "tsne", label = TRUE, group.by = 'seurat_clusters')

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')

Discussion: Which do you think is better between t-SNE and UMAP visualization results? Why?

Challenge 1: Characterising cluster 3 based on gene expression

According to the UMAP representation above, we found that cluster 3 seems quite different from other cells and cell clusters. We can dig a bit further to characterize this cluster.

First, we look for maker genes that were significantly differential expressed in Cluster 3 compared with other clusters. We use the FindAllMarkers() function to identify marker genes for each cluster.

This command line takes some time to run:

pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)

We then extract the top 5 marker of cluster 3 with the smallest p-values:

cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5, ]

We then search these marker genes in this database website https://panglaodb.se/search.html. We need to remove the dash (‘-’) before searching.

Based on the result provided by the database, it looks like cluster 3 might be a cluster of B cells. We can check whether this is the case by checking the cell-type ground truth information pre-saved in the pbmc_10x_v3 object.

You can try to annotate other clusters by yourself. The process of annotating each cluster using marker genes is also known as manual cell type annotation. We will try to do it automatically in the next section.

**ⓘ Ground-truth about cell-types* Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique higlighted above, or using other reference datasets.

Visualize marker genes

We can use violin plots to visualize the expression of one marker genes across all cell types.

For example, we choose the most significant marker genes from Cluster 4.

cluster4.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==4),]
cluster4.markers[1:5, ]

The most significant marker gene of Cluster 4 is VCAN.

If we want to visualize a marker gene across all cell types, we can use the Idents() function, specifying that CellType should be shown on the x-axios before using the VlnPlot() function to draw the violin plot:

Idents(object = pbmc_10x_v3) <- "CellType"
VlnPlot(pbmc_10x_v3, features = 'VCAN')

Exercise: What is the interpretation of this output?

3. Automatic cell type annotation for pbmc_10x_v2

Assume we have all cell type annotation for pbmc_10x_v3 and that we have a new set of data (stored in pbmc_10x_v2 that we wish to annotate automatically.

Remember: pbmc_10x_v2 and pbmc_10x_v3 include different cells from different patients and sequenced using platforms.

Exercise: pre-processing and visualization for pbmc_10x_v2

Exercise: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP.

Cell-type annotation with Seurat based on pbmc_10x_v3 as a reference

In Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.

We use the FindTransferAnchors() function to predict which cells in two datasets are of the same cell type. Here we use pbmc_10x_v3 as the reference data set, and pbmc_10x_v2 is the query data.

We specify that we use the top 30 PCs:

anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2, 
                               dims = 1:30)

We then assign the cell-type of the cells in pbmc_10x_v2 using the TransferData() function:

predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType, 
                                 dims = 1:30)

Seurat will provide a table with the most likely cell type and the probability of each cell type. We assign the most likely cell type to the pbmc_10x_v2 object:

pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id 

We then use UMAP to visualize this annotation:

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')

Alternative: Cell-type annotation using Azimuth (a website tool)

Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.

The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:

DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
saveRDS(pbmc_10x_v2_simple, 'pbmc_10x_v2.Rds')

An Rds file called pbmc_10x_v2.Rds is saved in your working directory. You can check where is your working directory by using the getwd() function.

Exercise Open the Azimuth website: https://azimuth.hubmapconsortium.org/.

To do cell type annotation (see also the slides here):

  1. Find ‘References for scRNA-seq Queries’ -> Then find ‘Human - PBMC’ -> click ‘Go to App’

  2. Click ‘Browse’ -> find ‘pbmc_10x_v2.Rds’ at your working directory -> Click ‘Open’

  3. Waiting for the Rds file upload to the website

  4. Click ‘Map cells to reference’

  5. Click ‘Download Results’

  6. Find ‘Predicted cell types and scores (TSV)’

  7. Click ‘Download’ to get the cell type annotation result stored in azimuth_pred.tsv

  8. Copy the tsv file (azimuth_pred.tsv) to your R working directory

The tsv file has the same data structure of Seurat annotation result (predictions). We read the tsv file, then add the annotation result to the pbmc_10x_v2 object with the AddMetaData() function:

azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)

We use UMAP to visualize the cell type annotation result from Azimuth.

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')

Comparing annotations

Here is the cell type annotation results provided by the data provider:

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')

Exercise Assuming the ground truth of the cell type annotation is from the provider, which cell type annotation is better from Seurat or Azimuth? Why?

Solution:

The annotation provided by the Azimuth is better.

Because Azimuth uses a larger reference data set, it can better learn the characteristics of each cell type to obtain more accurate cell type annotations.

4. Integration of two data sets

We have combined the two data sets without any processing in the Seurat object pbmc_combo. In this section we will introduce how we can combine these data sets (note: the process would be similar if you are combining different patients).

Challenge 2: Why can’t we analyze pbmc_combo directly?

Exercise: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP on pbmc_combo. Highlight the issues in this basic analysis where we are combining two independent data sets.

Removing batch effects

In Seurat, we use the FindIntegrationAnchors() function to identify cells with similar biological information between two data sets. The difference between cells in two data sets with similar biological information is considered as batch effect:

anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)

We then use the IntegrateData() function to remove batch effects and integrate the two data sets:

pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)

Exercise: Here is the code visualizing the batch corrected data using UMAP (need to scale the data first). What do you observe?

# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)

# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))

# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')

DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Experiment')

DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')

Solution:

Data points for the same cell type from different sequencing platforms changed from two clusters to one cluster. The algorithm successfully discovered and removed batch effects.